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KERNEL DENSITY ESTIMATION VIA DIFFUSION 

By Z. I. Botev 1 , J. F. Grotowski and D. P. Kroese 1 

University of Queensland 

We present a new adaptive kernel density estimator based on 
linear diffusion processes. The proposed estimator builds on existing 
ideas for adaptive smoothing by incorporating information from a pi- 
lot density estimate. In addition, we propose a new plug-in bandwidth 
selection method that is free from the arbitrary normal reference rules 
used by existing methods. We present simulation examples in which 
the proposed approach outperforms existing methods in terms of ac- 
curacy and reliability. 

1. Introduction. Nonparametric density estimation is an important tool 
in the statistical analysis of data. A nonparametric estimate can be used, 
for example, to assess the multimodality, skewness, or any other structure 
in the distribution of the data [47, 49] . It can also be used for the summa- 
rization of Bayesian posteriors, classification and discriminant analysis [50]. 
Nonparametric density estimation has even proved useful in Monte Carlo 
computational methods, such as the smoothed bootstrap method and the 
particle filter method [11]. Nonparametric density estimation is an alterna- 
tive to the parametric approach, in which one specifies a model up to a small 
number of parameters and then estimates the parameters via the likelihood 
principle. The advantage of the nonparametric approach is that it offers a far 
greater flexibility in modeling a given dataset and, unlike the classical ap- 
proach, is not affected by specification bias [37]. Currently, the most popular 
nonparametric approach to density estimation is kernel density estimation 
(see [47, 50, 53]). 

Despite the vast body of literature on the subject, there are still many 
contentious issues regarding the implementation and practical performance 
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of kernel density estimators. First, the most popular data-driven bandwidth 
selection technique, the plug-in method [26, 48], is adversely affected by the 
so-called normal reference rule [10, 25], which is essentially a construction of 
a preliminary normal model of the data upon which the performance of the 
bandwidth selection method depends. Although plug-in estimators perform 
well when the normality assumption holds approximately, at a conceptual 
level the use of the normal reference rule invalidates the original motivation 
for applying a nonparametric method in the first place. 

Second, the popular Gaussian kernel density estimator [42] lacks local 
adaptivity, and this often results in a large sensitivity to outliers, the presence 
of spurious bumps, and in an overall unsatisfactory bias performance — a 
tendency to flatten the peaks and valleys of the density [51]. 

Third, most kernel estimators suffer from boundary bias when, for ex- 
ample, the data is nonnegative — a phenomenon due to the fact that most 
kernels do not take into account specific knowledge about the domain of the 
data [41, 44]. 

These problems have been alleviated to a certain degree by the introduc- 
tion of more sophisticated kernels than the simple Gaussian kernel. Higher- 
order kernels have been used as a way to improve local adaptivity and reduce 
bias [28], but these have the disadvantages of not giving proper nonnegative 
density estimates, and of requiring a large sample size for good performance 
[42]. The lack of local adaptivity has been addressed by the introduction of 
adaptive kernel estimators [1, 15, 16, 27]. These include the balloon estima- 
tors, nearest neighbor estimators and variable bandwidth kernel estimators 
[39, 51], none of which yield bona fide densities, and thus remain somewhat 
unsatisfactory. Other proposals such as the sample point adaptive estima- 
tors are computationally burdensome (the fast Fourier transform cannot be 
applied [49] ) , and in some cases do not integrate to unity [44] . The boundary 
kernel estimators [24], which are specifically designed to deal with boundary 
bias, are either not adaptive away from the boundaries or do not result in 
bona fide densities [22]. Thus, the literature abounds with partial solutions 
that obscure a unified comprehensive framework for the resolution of these 
problems. 

The aim of this paper is to introduce an adaptive kernel density estimation 
method based on the smoothing properties of linear diffusion processes. The 
key idea is to view the kernel from which the estimator is constructed as the 
transition density of a diffusion process. We utilize the most general linear 
diffusion process that has a given limiting and stationary probability den- 
sity. This stationary density is selected to be either a pilot density estimate 
or a density that the statistician believes represents the information about 
the data prior to observing the available empirical data. The approach leads 
to a simple and intuitive kernel estimator with substantially reduced asymp- 
totic bias and mean square error. The proposed estimator deals well with 
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boundary bias and, unlike other proposals, is always a bona fide probability 
density function. We show that the proposed approach brings under a single 
framework some well-known bias reduction methods, such as the Abramson 
estimator [1] and other variable location or scale estimators [7, 18, 27, 46]. 

In addition, the paper introduces an improved plug-in bandwidth selec- 
tion method that completely avoids the normal reference rules [25] that have 
adversely affected the performance of plug-in methods. The new plug-in 
method is thus genuinely "nonparametric," since it does not require a pre- 
liminary normal model for the data. Moreover, our plug-in approach does 
not involve numerical optimization and is not much slower than computing 
a normal reference rule [4]. 

The rest of the paper is organized as follows. First, we describe the Gaus- 
sian kernel density estimator and explain how it can be viewed as a special 
case of smoothing using a diffusion process. The Gaussian kernel density 
estimator is then used to motivate the most general linear diffusion that 
will have a set of essential smoothing properties. We analyze the asymptotic 
properties of the resulting estimator and explain how to compute the asymp- 
totically optimal plug-in bandwidth. Finally, the practical benefits of the 
model are demonstrated through simulation examples on some well-known 
datasets [42]. Our findings demonstrate an improved bias performance and 
low computational cost, and a boundary bias improvement. 



2. Background. Given N independent realizations Af/v = {X\, . . . ,Xn} 
from an unknown continuous probability density function (p.d.f.) / on S£ , 

the Gaussian kernel density estimator is defined as 

1 N 

(1) f(x;t) = -J2^ x ' X ^ xGM ' 

i=l 

where 

cf ) (x,X i ;t) = ^e-^- x ^/W 
V 2vrt 

is a Gaussian p.d.f. (kernel) with location X{ and scale \ft. The scale is 
usually referred to as the bandwidth. Much research has been focused on the 
optimal choice of t in (1), because the performance of / as an estimator of 
/ depends crucially on its value [26, 48]. A well-studied criterion used to 
determine an optimal t is the Mean Integrated Squared Error (MISE), 

MISE{/}(t)=E/ J [f(x;t)-f(x)] 2 dx, 
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which is conveniently decomposed into integrated squared bias and inte- 
grated variance components: 

MISE{/}(*) = J ( E f jf(x; t)] - f(x)f dx + J Yar f [f(x;t)} dx. 

pointwise bias of / pointwise variance of / 

Note that the expectation and variance operators apply to the random sam- 
ple X??. The MISE depends on the bandwidth y/i and / in a quite com- 
plicated way. The analysis is simplified when one considers the asymptotic 
approximation to the MISE, denoted AMISE, under the consistency re- 
quirements that t = ijv depends on the sample size N such that tjy i and 
Nt/En — >■ oo as N — > oo , and / is twice continuously differentiable [48] . The 
asymptotically optimal bandwidth is then the minimizer of the AMISE. 
The asymptotic properties of (1) under these assumptions are summarized 
in Appendix A. 

A key observation about the Gaussian kernel density estimator (1) is that 
it is the unique solution to the diffusion partial differential equation (PDE) 

(2) d\^ x ' t) = \^ {x]t) ' *e#",*>o, 

with SC = R and initial condition f(x\ 0) = A(x), where A(x) = i Yli=i ^( x ~ 
Xi) is the empirical density of the data Xn [here 5(x — Xi) is the Dirac mea- 
sure at Xi]. Equation (2) is the well-known Fourier heat equation [36]. This 
link between the Gaussian kernel density estimator and the Fourier heat 
equation has been noted in Chaudhuri and Marron [6]. We will, however, 
go much further in exploiting this link. In the heat equation interpretation, 
the Gaussian kernel in (1) is the so-called Green's function [36] for the dif- 
fusion PDE (2). Thus, the Gaussian kernel density estimator f(x;t) can be 
obtained by evolving the solution of the parabolic PDE (2) up to time t. 

To illustrate the advantage of the PDE formulation over the more tra- 
ditional formulation (1), consider the case where the domain of the data is 
known to be X = [0,1]. It is difficult to see how (1) can be easily modified to 
account for the finite support of the unknown density. Yet, within the PDE 
framework, all we have to do is solve the diffusion equation (2) over the 
finite domain [0, 1] with initial condition A(cc) and the Neumann boundary 
condition 



d_ 

dx" 



= o. 

x=0 



dx" 

The boundary condition ensures that A J^- f(x;t)dx = 0, from where it 

follows that j^- f(x;t) dx = f^- f(x; 0) dx = 1 for all t > 0. The analytical 
solution of this PDE in this case is [3] 

1 N 

(3) f(x;t) = -J2<^Xi;t), xe[0,l], 
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where the kernel n is given by 

oo 

(4) K(x,Xf,t)= ^2 <t>{x,2k + Xi\t) + <t>{x,2k- Xf,t), see [0,1]. 

k=— oo 

Thus, the kernel accounts for the boundaries in a manner similar to the 
boundary correction of the reflection method [49]. We now compare the 
properties of the kernel (4) with the properties of the Gaussian kernel (j> in 
(!)• 

First, the series representation (4) is useful for deriving the small band- 
width properties of the estimator in (3). The asymptotic behavior of n(x,Xf,t) 
as t —> in the interior of the domain [0, 1] is no different from that of the 
Gaussian kernel, namely, 

oo 

<f>{x,2k + Xi;t) + <l>(x,2k-Xi;t)~<j>(x,Xi;t), t±0, 

k=—oo 

for any fixed x in the interior of the domain [0,1]. Here q(t) ~ z(t),t J, to 
stands for lim^ jkl = 1. Thus, for small t, the estimator (3) behaves like 
the Gaussian kernel density estimator (1) in the interior of [0,1]. Near the 
boundaries at x = 0,1, however, the estimator (3) is consistent, while the 
Gaussian kernel density estimator is inconsistent. In particular, a general 
result in Appendix D includes as a special case the following boundary prop- 
erty of the estimator (3): 

E f f{x N ;t N ) = f{x N ) + 0(VUj), iV^oo, 

where x/v = crfv for some a G [0, 1], and ijv4- as N — > oo. This shows that 
(3) is consistent at the boundary x = 0. Similarly, (3) can be shown to be 
consistent at the boundary x = 1. In contrast, the Gaussian kernel density 
estimator (1) is inconsistent [53] in the sense that 

E//(0;tj V ) = |/(0) + O(v / ^), iV^oo. 

The large bandwidth behavior (t —> oo) of (3) is obtained from the follow- 
ing equivalent expression for (4) (see [3]): 

oo 

(5) n{x,Xi-t)= e- k2 ^ t/2 cos{knx)cos{kTiXi). 

k=—oo 

From (5), we immediately see that 

(6) K(x,X i ;t)~l + 2e" 7r2 * /2 cos(7rx)cos(7rX i ), t — >■ oo, x £ [0, 1]. 

In other words, as the bandwidth becomes larger and larger, the kernel (4) 
approaches the uniform density on [0, 1] . 
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Remark 1. An important property of the estimator (3) is that the 
number of local maxima or modes is a nonincreasing function of t. This 
follows from the maximum principle for parabolic PDE; see, for example, 
[36]. 

For example, a necessary condition for a local maximum at, say, (xq, io)>^o > 
0,x £ (0, 1) is -§^rf(x ;t ) < 0. From (2), this implies ^f(x ;t ) < 0, from 

which it follows that there exists an e > such that /(xo;io) — /(^o^o + 
As a consequence of this, as t becomes larger and larger, the number of 
local maxima of (3) is a nonincreasing function. This property is shared by 
the Gaussian kernel density estimator (1) and has been exploited in various 
ways by Silverman [49]. 

Example 1. Figure 1 gives an illustration of the performance of estima- 
tors (3) and (1), where the true p.d.f. is the beta density 4(1 — x) 3 ,x € [0, 1], 
and the estimators are build from a sample of size N = 1000 with a common 
bandwidth yt = 0.05248. Note that the Gaussian kernel density estimator is 
close to half the value of the true p.d.f. at the boundary x = 0. Overall, the 
diffusion estimator (3) is much closer to the true p.d.f. The proposed estima- 
tor (3) appears to be the first kernel estimator that does not use boundary 
transformation and yet is consistent at all boundaries and remains a genuine 
p.d.f. (is nonnegative and integrates to one). Existing boundary correction 
methods [19, 31, 32] either account for the bias at a single end-point, or the 
resulting estimators are not genuine p.d.f.'s. 

Remark 2. In applications such as the smoothed bootstrap [11], there 
is a need for efficient random variable generation from the kernel density 
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estimate. Generation of random variables from the kernel (4) is easily ac- 
complished using the following procedure. Generate Z~ N(0,i) and let 
Y = Xi + Z. Compute W = Ymod2, and let X = \ W\. Then it is easy to 
show (e.g., using characteristic functions) that X has the density given by 
(4). 

Given the nice boundary bias properties of the estimator that arises as the 
solution of the diffusion PDE (2), it is of interest to investigate if equation 
(2) can be somehow modified or generalized to arrive at an even better 
kernel estimator. This motivates us to consider in the next section the most 
general linear time-homogeneous diffusion PDE as a starting point for the 
construction of a better kernel density estimator. 

3. The diffusion estimator. Our extension of the simple diffusion model 

(2) is based on the smoothing properties of the linear diffusion PDE 

(7) ^- t g{x;t)=Lg(x;t), x£%~,t>0, 

where the linear differential operator L is of the form \^i a { x )^{^£j))-, 
and a and p can be any arbitrary positive functions on with bounded 
second derivatives, and the initial condition is g(x,0) = A(x). If the set SC 
is bounded, we add the boundary condition ( 9 ^£) ) = on d!% , which 
ensures that the solution of (7) integrates to unity. The PDE (7) describes 
the p.d.f. of Xt for the Ito diffusion process (Xt,t > 0) given by [12] 

(8) dX t = n(X t )dt + a(Xt)dBt, 

where the drift coefficient fi(x) = 2p(x) ' * ne diffusion coefficient a(x) = J 
the initial state Xq has distribution A(x), and (Bt,t > 0) is standard Brown- 
ian motion. Obviously, if a = 1 and p = 1, we revert to the simpler model (2). 
What makes the solution g(x;t) to (7) a plausible kernel density estimator is 
that g(x;t) is a p.d.f. with the following properties. First, g(-;0) is identical 
to the initial condition of (7), that is, to the empirical density A(x). This 
property is possessed by both the Gaussian kernel density estimator (1) and 
the diffusion estimator (3). Second, if p(x) is a p.d.f. on ^T, then 

lim g(x; t) = p(x), x £ X. 

t—>oo 

This property is similar to the property that the kernel (6) and the estimator 

(3) converge to the uniform density on X = [0, 1] as t — > oo. In the context of 
the diffusion process governed by (8), p is the limiting and stationary density 
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of the diffusion. Third, similar to the estimator (3) and the Gaussian kernel 
density estimator (1), we can write the solution of (7) as 



(9) g(x;t) = ^J2 K (x,X l ;t), 

i=l 

where for each fixed y S the diffusion kernel k satisfies the PDE 



d 

/ 10 \ i Qj_K{x,y;t) =LK[x,y;t), xe&,t>0, 

k(x, y; 0) = 5(x — y), x G SC. 

In addition, for each fixed x £ X the kernel k satisfies the PDE 



[11) 



—K(x,y;t) = L*n(x,y;t), yGJT,t>0, 
jt(x,y;0) = 5(x-y), V^X, 



where L* is of the form 2 p(y) ^( a (^)^('))i that is, L* is the adjoint operator 
of L. Note that L* is the infinitesimal generator of the ltd diffusion process in 
(8). If the set has boundaries, we add the Neumann boundary condition 



9 ( K(x,y;t) 



dx \ p{x) 



= Vt > 



and -^K{x,y;t)\ y& Q3: = to (10) and (11), respectively. These boundary 
conditions ensure that g(x;t) integrates to unity for all t > 0. The reason that 
the kernel k satisfies both PDEs (10) and (11) is that (10) is the Kolmogorov 
forward equation [12] corresponding to the diffusion process (8), and (11) is 
a direct consequence of the Kolmogorov backward equation. We will use the 
forward and backward equations to derive the asymptotic properties of the 
diffusion estimator (9). Before we proceed with the asymptotic analysis, we 
illustrate how the model (7) possesses adaptive smoothing properties similar 
to the ones possessed by the adaptive kernel density estimators [1, 15, 16, 27]. 



Example 2. Suppose that the initial condition of PDE (7) is A(x) 
with N = 500,000 and X±, . . . ,Xn are independent draws from f(x) = 1 — 
cos(67T2;),x G [0,1]. Suppose further that p(x) = 4(1 — x) 3 and a(x) = 1 on 
[0, 1]. The aim of this example is not to estimate /, but to illustrate the var- 
ious shapes that the estimator can take, given data from /. Figure 2 shows 
the solution of the PDE (7) for two values of the bandwidth: \/t = 4 x 10~ 4 
(small) and \ft = 0.89 (large). Since p(x) is the limiting and stationary den- 
sity of the diffusion process governed by (7), the large bandwidth density is 
indistinguishable from p(x). The small bandwidth density estimate is much 
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Fig. 2. Small and large bandwidth behavior of the diffusion density in Example 2. 

closer to f{x) than to p(x). The crucial feature of the small bandwidth den- 
sity estimate is that p(x) allows for varying degrees of smoothing across the 
domain of the data, in particular allowing for greater smoothing to be ap- 
plied in areas of sparse data, and relatively less in the high density regions. 
It can be seen from Figure 2 that the small time density estimate is noisier in 
regions where p(x) is large (closer to x = 0), and smoother in regions where 
p(x) is small (closer to x = 1). The adaptive smoothing is a consequence of 
the fact that the diffusion kernel (10) has a state-dependent diffusion coef- 
ficient a(x) = y / a(x)/p(x), which helps diffuse the initial density A(x) at a 
different rate throughout the state space. 

Remark 3. Even though there is no analytical expression for the diffu- 
sion kernel satisfying (10), we can write k in terms of a generalized Fourier 
series in the case that is bounded: 

oo 

(13) K(x,y;t) =p(x)^2e Xkt tp k (x)tpk(y), x, ye [0,1], 

fc=o 

where and are the eigenfunctions and eigenvalues of the Sturm- 
Liouville problem on [0,1]: 

L*ip k = Xkfk, k = 0,1,2,..., 

(14) 

¥4(0) = ^(1)=0, £ = 0,1,2,.... 

It is well known (see, e.g., [36]) that {(pk} forms a complete orthonormal basis 
with respect to the weight p for L 2 (0, 1). From the expression (13), we can see 
that the kernel satisfies the detailed balance equation for a continuous-time 
Markov process [12] 

(15) p(y)K{x,y;t) =p(x)n(y,x;t) Vi > 0,x,y e X . 





small bandwidth 
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The detailed balance equation ensures that the limiting and stationary den- 
sity of the diffusion estimator (9) is p{x). In addition, the kernel satisfies the 
Chapman-Kolmogorov equation 



JX 

Note that there is no loss of generality in assuming that the domain is 
[0,1], because any bounded domain can be mapped onto [0,1] by a linear 
transformation. 

Remark 4. When p{x) is a p.d.f., an important distance measure be- 
tween the diffusion estimator (9) and p(x) is the divergence measure of 
Csiszar [9]. The Csiszar distance measure between two continuous probabil- 
ity densities g and p is defined as 



where ip : M+ — > R_|_ is a twice continuously differentiable function; tp(l) = 0; 
and ip"{x) > for all x € R+. The diffusion estimator (9) possesses the 
monotonicity property 



In other words, the distance between the estimator (9) and the stationary 
density p is a monotonically decreasing function of the bandwidth y/i. This is 
why the solution of (7) in Figure 2 approaches p as the bandwidth becomes 
larger and larger. Note that Csiszar's family of measures subsumes all of 
the information-theoretic distance measures used in practice [21, 30]. For 
example, if ip{x) = ^^Efj ■> a 7^ 0, 1, for some parameter a, then the family of 
distances indexed by a includes the Hellinger distance for a = 1/2, Pearson's 
X 2 discrepancy measure for a = 2, Neymann's x 2 measure for a = — 1, the 
Kullback-Leibler distance in the limit as a — > 1 and Burg's distance as a — > 0. 

4. Bias and variance analysis. We now examine the asymptotic bias, 
variance and MISE of the diffusion estimator (9). In order to derive the 
asymptotic properties of the proposed estimator, we need the small band- 
width behavior of the diffusion kernel satisfying (10). This is provided by 
the following lemma. 

Lemma 1. Assume that the functions a(x) and p(x) are such that 



(16) 








a 




KERNEL DENSITY ESTIMATION VIA DIFFUSION 



11 



(17) 



lim 



Jz 



ds = oo. 



Then, the leading small bandwidth asymptotic behavior of the kernel satisfy- 
ing (10) and (11) on SC = R is 



K(x,y;t) 



p(x) 



/ 2Tri[p{x)a{x)a(y)p(y)} 1 / i 



x exp< 

I 2t 




no. 



We denote the asymptotic approximation on the right-hand side by n(x,y;t). 
Thus, K,(x,y;t) ~K,(x,y;t) as 1 10. 

The somewhat lengthy and technical proof is given in Appendix B. A few 
remarks about the technical conditions on a and p now follow. Conditions 
(17) are trivially satisfied if a,p and its derivatives up to order 2 are all 
bounded from above, and p(x) > po > and a(x) > ao > 0. In other words, 
if we clip p(x) away from zero and use a(x) =p a (x) for a £ [0,1], then the 
conditions (17) are satisfied. Such clipping procedures have been applied in 
the traditional kernel density estimation setting, see [1, 7, 16, 18, 27]. Note 
that the conditions are more easily satisfied when p is heavy-tailed. For 
example, if a{x) =p(x), then p could be any regularly varying p.d.f. of the 
form p oc (1 + |x|) _Q! , a > 1. Lemma 1 is required for deriving the asymptotic 
properties of the estimator, all collected in the following theorem. 



Theorem 1. Let t = t]y be such that limjv^oo tpj = 0, lini7v->oo ^V^n = 

00. Assume that f is twice continuously differ entiable and that the domain 
3£ = R. Then: 

1. The pointwise bias has the asymptotic behavior 

(18) E f [g(x;t)}-f(x) = tLf(x) + 0(t 2 ), A^oo. 

2. The integrated squared bias has the asymptotic behavior 



(19) ||E / [ & (.;t)]-/|| a ~t a ||L/|| a = Jt 2 ||(a(//p)') , || a , A- 



3. The pointwise variance has the asymptotic behavior 

fix) 



(20) 



Vax f [g(x;t)] 



2AyW(x) ' 



A-^oo, 



oo. 



where a 2 (x) = a(x)/p(x). 
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4. The integrated variance has the asymptotic behavior 

(21) [ V&v f [g(x; t)] dx ~ , A^oo. 

5. Combining the leading order bias and variance terms gives the asymptotic 
approximation to the MISE 

(22) AMISEM(t) = -A\{a{f/ P )'^ ■ 



4 2ArV7rt 
6. Hence, the square of the asymptotically optimal bandwidth is 



(23) t* : 

which gives the minimum 



\2Ny/Z\\Lf\\*J ' 
4/5 



(24) minAMISEM(t)=iV 
The proof is given in Appendix C. 

We make the following observations. First, if p ^ /, the rate of convergence 
of (24) is 0(A~ 4 / 5 ), the same as the rate of the Gaussian kernel density 
estimator in (39). The multiplicative constant of A -4 / 5 in (24), however, 
can be made very small by choosing p to be a pilot density estimate of 
/. Preliminary or pilot density estimates are used in most adaptive kernel 
methods [53]. Second, if p = f, then the leading bias term (18) is 0. In fact, 
if / is infinitely smooth, the pointwise bias is exactly zero, as can be seen 
from 

OO 

E f [g&t)}=Y,M Lk fW> Z GC °°< 

fc=0 

where L n+l = LL n and L° is the identity operator. In addition, if a =p oc 1, 
then the bias term (18) is equivalent to the bias term (35) of the Gaussian 
kernel density estimator. Third, (20) suggests that in regions where the pilot 
density p(x) is large [which is equivalent to small diffusion coefficient o~(x)] 
and f(x) is large, the pointwise variance will be large. Conversely, in regions 
with few observations [i.e., where the diffusion coefficient o~(x) is high and 
f{x) is small] the pointwise variance is low. In other words, the ideal variance 
behavior results when the diffusivity o~(x) behaves inversely proportional to 

/(*)■ 
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4.1. Special cases of the diffusion estimator. We shall now show that 
the diffusion kernel estimator (9) is a generalization of some well-known 
modifications of the Gaussian kernel density estimator (1). Examples of 
modifications and improvements subsumed as special cases of (9) are as 
follows. 

1. If a(x) = p(x) oc 1 in (9) and SC = K, then the kernel k reduces to the 
Gaussian kernel and we obtain (1). 

2. If a(x) = 1 and p(x) = f p (x), where f p is a clipped pilot density estimate 
of / (see [1, 18, 27]), then from Lemma 1, we have 

\Jf p (s)ds 



k(x, y: t) ~ k(x, y: t) = ■ — ^LJ expi 

Thus, in the neighborhood of y such that \x — y\= 0{t"),j3 > 1/3, we 
have 

1 r (a: — y) 2 1 

y / 2irt/f p {x) { 2t/f p {x) J 

In other words, in the neighborhood of y, k is asymptotically equiva- 
lent to a Gaussian kernel with mean y and bandwidth Utj f p (y), which 
is precisely the Abramson's variable bandwidth [1] modification as ap- 
plied to the Gaussian kernel. Abramson's square root law states that the 

— 1/2 

asymptotically optimal variable bandwidth is proportional to f p (y). 

3. If we choose a(x) = p(x) = f p (x), then in an 0(t^),/3 > neighborhood of 
y, the kernel n(x,y;t) behaves asymptotically as a Gaussian kernel with 

location y + | and bandwidth s/i: 

<X ^ t] ^^t e A~^{ X - y -\m) }' 

This is precisely the data sharpening modification described in [46], where 
the locations of the data points are shifted prior to the application of 
the kernel density estimate. Thus, in our paradigm, data sharpening is 

equivalent to using the diffusion (7) with drift fx(x) = 2 j ( x ) an d diffusion 
coefficient o~(x) = 1. 

4. Finally, if we set p(x) = f p (x) and a(x) = p a (x), a £ [0, 1], then we ob- 
tain a method that is a combination of both the data sharpening and the 
variable bandwidth of Abramson. The kernel k behaves asymptotically [in 
an 0(^),/3 > 1/3 neighborhood of y] like a Gaussian kernel with location 



y + tfi(y) = y + f f^ 2 {y)f' p {y) and bandwidth y/to*(y) = ^tf^\y). 
Similar variable location and scale kernel density estimators are consid- 
ered in [27]. 
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The proposed method thus unifies many of the already existing ideas for 
variable scale and location kernel density estimators. Note that these esti- 
mators all have one common feature: they compute a pilot density estimate 
(which is an infinite-dimensional parameter) prior to the main estimation 
step. 

Our choice for a(x) will be motivated by regularity properties of the diffu- 
sion process underlying the smoothing kernel. In short, we prefer to choose 
a{x) = 1 so as to make the diffusion process in (8) nonexplosive with a 
well-defined limiting distribution. A necessary and sufficient condition for 
explosions is Feller's test [13]. 

Theorem 2 (Feller's test). Let fi(x) > and a(x) > be bounded and 
continuous. Then the diffusion process (8) explodes if and only if there exists 
z £ K such that either one of the following two conditions holds: 

1. 

dx < OO, 



2. 

poo rx 



( f y 2fi(s) \ _ 2 
exp / as a (y) dydx<oo 

\Jx o- 2 {s) ) 



A corollary of Feller's test is that when \i{x) = both of Feller's conditions 
fail, and diffusions of the form dXt = at dWt are nonexplosive. 

Since in our case we have a 2 (x) = a(x)/p(x) and a(x) = exp( 2fi(y)/ 
o~ 2 (y)dy), Feller's condition becomes the following. 

Proposition 1 (Feller's test). Given a(x) andp(x) in (7), the diffusion 
process (8) explodes if and only if there exists z£R such that either one of 
the following two conditions holds: 
1. 

p(y) 



2. 



a{x) 

p{y) 

a{x) 



dydx < oo, 
dydx < oo. 



The easiest way to ensure nonexplosiveness of the underlying diffusion 
process and the existence of a limiting distribution is to set a(x) = 1, which 
corresponds to fi{x) = 0. Note that a necessary condition for the existence 
of a limiting p.d.f. is the existence of z such that l/a(x) dx = oo. In this 
case, both of Feller's conditions fail. The nonexplosiveness property ensures 
that generation of random variables from the diffusion estimator does not 
pose any technical problems. 
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5. Bandwidth selection algorithm. Before we explain how to estimate 
the bandwidth yr* in (23) of the diffusion estimator (9), we explain how to 
estimate the bandwidth yf*i in (38) (see Appendix A) of the Gaussian kernel 
density estimator (1). Here, we present a new plug-in bandwidth selection 
procedure based on the ideas in [23, 26, 40, 48] to achieve unparalleled 
practical performance. The highlighting feature of the proposed method is 
that it does not use normal reference rules and is thus completely data- 
driven. 

It is clear from (38) in Appendix A that to compute the optimal *i for the 
Gaussian kernel density estimator (1) one needs to estimate the functional 
||/"|| 2 . Thus, we consider the problem of estimating ||/^|| 2 for an arbitrary 
integer j > I. The identity ||/^|| 2 = (-1) j Ej[/( 2j ')(X)] suggests two possible 
plug-in estimators. The first one is 

f_iy N 

(_l)iE / /(^:=i-^-V/W(X*;t i ) 



(25) 



N 

k=l 



IV N N 



AT2 

k=l m=l 

where / is the Gaussian kernel density estimator (1). The second estimator 
is 

ll/^f :=ll/ (i) (-;*)f 

N N 



(26) =^2 EE / ^H^X^tj^^X^dx 

k=l m=l J R 

r-iv' N N 

k=l m=l 

where the last line is a simplification following easily from the fact that the 
Gaussian kernel (j) satisfies the Chapman-Kolmogorov equation (16). For a 

given bandwidth, both estimators (— l) J E^/( 2 -?) and ||/^')|| 2 aim to estimate 
the same quantity, namely ||/^|| 2 - We select tj so that both estimators (25) 
and (26) are asymptotically equivalent in the mean square error sense. In 

other words, we choose tj = *tj so that both (-l) j K f f( 2 ^ and \\f^\\ 2 have 
equal asymptotic mean square error. This gives the following proposition. 

Proposition 2. The estimators (— lyEff^ and ||/^|| 2 have the same 
asymptotic mean square error when 

1 + 1/23+1/2 i x 3 x 5 x . . . x ( 2j - _ i)\ 2 /( 3 + 2 ^) 



(27) M 



r/2||/0-+D|| 2 
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Proof. The arguments are similar to the ones used in [53] . Under the as- 
sumptions that tj depends on N such that limAr_ s . 00 tj = and 

lirn/v-^oc, Nt^ +l ^ 2 = oo, we can take the expectation of the estimator (25) 



and obtain the expansion (tj = t): 



l^')(0,0;i) + ^ / 
1 x 3 x ■ • ■ x (2j - 1) 



f(x) (V 2 %) + If^ix) + o(t)\ dx + OiN- 1 ) 



1 x 3 x 5 x • • • x (2j 



^4ll/°' +1) H 2 



+ (-l)i/ (i) l| 2 + 0(iV- 1 ), AT^oo. 
Hence, the squared bias has asymptotic behavior (iV — > oo) 

((-1)%;^)] - n/wf) 2 ~ ( lx3 / + ;; 2 ^" 1) - \\\f^ 

A similar argument (see [53]) shows that the variance is of the order 0(N~ 2 x 
t~ 2 i~ l l 2 ), which is of lesser order than the squared bias. This implies that the 

leading order term in the asymptotic mean square error of E^/( 2j ) is given 
by the asymptotic squared bias. There is no need to derive the asymptotic 

expansion of Ef[||/w|| 2 ], because inspection of (26) and (25) shows that 

H/C?) || 2 exactly equals (-l) j E f f( 2 ^ when the latt er is evaluated at 2tj. In 
other words, 

V ' }UU " J (2t)'J+ l / 2 ^N 

+ t\\f ij+1) \\ 2 + 0(1 + N- 1 ). 

Again, the leading term of the asymptotic mean square error of ||/^|| 2 is 

given by the leading term of the squared bias of ||/^^|| 2 - Thus, equalizing the 
asymptotic mean squared error of both estimators is the same as equalizing 
their respective asymptotic squared biases. This yields the equation 

• lx3x-x(2j-l) , ||f(j+ i)„2 X2 

{2ty+ l / 2 V 7 ^N " 
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1 x 3 x • • • x (2j - 1) t 



^\\f U+1] \\ 



2 



The positive solution of the equation yields the desired *tj. □ 

Thus, for example, 
(28, A ' 8 + ^ 3 ^ 



24 JVvV72||/< 3 )|| : 

is our bandwidth choice for the estimation of ||/"|| 2 . We estimate each *tj 
by 

(29) „£• - ' 1 + 1/23+1,2 1 x 3 x 5 x ••• x (2j - 1) N 



6 Af^72|l/ (i+1) ll 2 

Computation of ||/^ +1 ^|| 2 requires estimation of *tj+i itself, which in turn 
requires estimation of *tj+2, an d so on, as seen from formulas (26) and 
(29). We are faced with the problem of estimating the infinite sequence 
{*tj + fc,A; > 1}. It is clear, however, that given for some I > we can 

estimate all {*tj, 1 < j < 1} recursively, and then estimate *t itself from (38). 
This motivates the l-stage direct plug-in bandwidth selector [26, 48, 53], 
defined as follows. 

1. For a given integer I > 0, estimate via (27) and ||/ (z+2) || 2 computed 
by assuming that / is a normal density with mean and variance estimated 
from the data. Denote the estimate by 

2. Use to estimate ||/^ +1 ^|| 2 via the plug-in estimator (26) and *£/ via 
(29). Then use *ti to estimate and so on until we obtain an estimate 
of *t2- 

3. Use the estimate of *i2 to compute *t from (38). 

The /-stage direct plug-in bandwidth selector thus involves the estimation 
of I functionals {||/^')||,2 < j ; < I + 1} via the plug-in estimator (26). We 
can describe the procedure in a more abstract way as follows. Denote the 
functional dependence of *tj on in formula (29) as 

It is then clear that Jj = 7j(7j+i(**7+2)) = 7i(7j'+i(7i+2(**j+3))) = • • • • For 
simplicity of notation, we define the composition 

7 W( t ) = 7l (... 7fe „ 1 ( 7fc (t))...), k >l. 

k times 
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Inspection of formulas (29) and (38) shows that the estimate of *i satisfies 
J = e.ti = £7 [11 (**2) = £7 [2] (*4) = • • • = 

e= f 7 j ~o.9o. 

Then, for a given integer Z > 0, the Z-stage direct plug-in bandwidth selector 
consists of computing 

J=fr [l] (*t l+1 ), 

where is estimated via (27) by assuming that / in ||/^ +2 ^|| 2 is a normal 
density with mean and variance estimated from the data. The weakest point 
of this procedure is that we assume that the true / is a Gaussian density 
in order to compute ||/^ +2 ^|| 2 . This assumption can lead to arbitrarily bad 
estimates of *i, when, for example, the true / is far from being Gaussian. 
Instead, we propose to find a solution to the nonlinear equation 

(30) * = f7 M (<), 

for some I, using either fixed point iteration or Newton's method with initial 
guess t = 0. The fixed point iteration version is formalized in the following 
algorithm. 

Algorithm 1 (Improved Sheather-Jones). Given I > 2, execute the fol- 
lowing steps: 

1. initialize with zq = e, where e is machine precision, and n = 0; 

2. set z n+ i =£jV\(zn); 

3. if \z n+ i — z n \ < e, stop and set *t = z n+ \; otherwise, set n := n + 1 and 
repeat from step 2; 

4. deliver the Gaussian kernel density estimator (1) evaluated at *t as the 
final estimator of /, and S2 = 7^~ 1 H z «+i) as the bandwidth for the op- 
timal estimation of ||/"|| 2 . 

Numerical experience suggests the following. First, the fixed-point algo- 
rithm does not fail to find a root of the equation t = £7^(i). Second, the 
root appears to be unique. Third, the solutions to the equations 

and 

t = Zi [l+5] (t) 

for any I > do not differ in any practically meaningful way. In other words, 
there were no gains to be had by increasing the stages of the bandwidth 
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Fig. 3. The Improved Sheather-Jones bandwidth selection rule in Algorithm 1 leads to 
improved performance compared to the original plug-in rule that uses the normal reference 
rule. 

selection rule beyond 1 = 5. We recommend setting / = 5. Finally, the nu- 
merical procedure for the computation of 7^(i) is fast when implemented 
using the Discrete Cosine Transform [4]. 

The plug-in method described in Algorithm 1 has superior practical per- 
formance compared to existing plug-in implementations, including the par- 
ticular solve-the- equation rule of Sheather and Jones [48, 53]. Since we bor- 
row many of the fruitful ideas described in [48] (which in turn build upon 
the work of Hall, Park and Marron [17, 45]), we call our new algorithm the 
Improved Sheather-Jones (ISJ) method. 

To illustrate the significant improvement of the plug-in method in Algo- 
rithm 1, consider, for example, the case where / is a mixture of two Gaussian 
densities with a common variance of 1 and means of —30 and 30. 

Figure 3 shows the right mode of /, and the two estimates resulting from 
the old plug-in rule [48] and the plug-in rule of Algorithm 1 . The left mode is 
not displayed, but looks similar. The integrated squared error using the new 
plug-in bandwidth estimate, ||/ — /(-;*t)|| 2 , is one 10th of the error using 
the old bandwidth selection rule. 

5.1. Experiments with normal reference rules. The result of Figure 3 is 
not an isolated case, in which the normal reference rules do not perform 
well. We performed a comprehensive simulation study in order to compare 
the Improved Sheather-Jones (ISJ) (Algorithm 1) with the original (vanilla) 
Sheather-Jones (SJ) algorithm [48, 53]. 
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Table 1 

Results over 10 independent simulation experiments. In all cases the domain was 
assumed to be R. Many test problems are taken from [42]. In the table N(fj,,a 2 ), denotes 
a Gaussian density with mean fj, and variance a 2 



Case 




Target density f(x) 


N 


Ratio 


1 (claw) 




iN(0,l) + £t=oraN(!-l>(ro) 2 ) 


10 :i 
10 4 


0.72 
0.94 


2 (strongly skewed) 




ELo|N(3((|) fc -l),(|) 2fc ) 


10 3 
10 4 


0.69 
0.84 


3 (kurtotic unimodal) 




|N(0,l) + iN(0,(i) 2 ) 


10 2 
10 3 


0.78 
0.93 


4 (double claw) 




^N(-l,(|) a ) + ^N(l,(|) a ) 


10 5 


0.35 






~ 350 Z^fe=0 lv V 2 »V100/ > 


10 6 


0.10 


5 (discrete comb) 


2 sr 2 

7 Z-,k= 


^ N(M,(f) 2 ) + iE^ 8 N(f,(^) 2 ) 


10 3 
10 4 


0.45 
0.27 


6 (asymmetric 


46 v-il 
100 2-^k- 


. N(2fc-l,(|) 2 ) + ELi3SoN(-|,( T i,) 2 ) 


10 4 


0.68 


double claw) 


+ Efc=l 300^(1' ( 100 ) ) 




U.Z4 


7 (outlier) 




iN(0,l) + |N(0,(lf) 


10 3 
i n 6 


1.01 
i on 


8 (separated bimodal) 




lN(-12,i) + iN(12,i) 


10 2 
10 3 


0.33 
0.64 


9 (skewed bimodal) 




|N(0,l) + iN(|,(i) 2 ) 


10 3 
10 4 


1.02 
1.00 


10 (bimodal) 




iN(0,(i) 2 ) + iN(5,l) 


10 2 
10 3 


0.31 
0.70 


11 




Log-Normal with [i = and a = 1 


10 3 
10 4 


0.82 
0.80 


12 (asymmetric claw) 


iN(0,l)+EL^N(fc + l,(^) a ) 


10 3 
10 4 


0.76 
0.59 


13 (trimodal) 




|ELo N ( 80fc ;( fc + 1 ) 4 ) 


10 2 
10 3 


0.21 
0.17 


14 (5-modes) 




iEt=o N (80fc;(fc + i) 2 ) 


10 3 
10 4 


0.07 
0.18 


15 (10-modes) 




roELoNCioofcKfe + i) 2 ) 


10 3 
10 4 


0.12 
0.07 


16 (smooth comb) 




V^5 2 5 - fc mi- 65-96/2 fc (32/63) 2 i 
l^k=0 63 ™\ 21 ' 2 2fe > 


10 4 
10 5 


0.40 
0.34 
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Table 1 shows the average results over 10 independent trials for a number 
of different test cases. The second column displays the target density and 
the third column shows the sample size used for the experiments. The last 
column shows our criterion for comparison: 

R _ ll/(-;**)-/ll 2 
ll/(-;isj)-/ll 2 ' 

that is, the ratio of the integrated squared error of the new ISJ estimator 
to the integrated squared error of the original SJ estimator. Here, tsj is the 
bandwidth computed using the original Sheather- Jones method [48, 53]. 

The results in Table 1 show that the improvement in the integrated 
squared error can be as much as ten-fold, and the ISJ method outperforms 
the SJ method in almost all cases. The evidence suggests that discarding 
the normal reference rules, widely employed by most plug-in rules, can sig- 
nificantly improve the performance of the plug-in methods. 

The multi-modal test cases 12 through 16 in Table 1 and Figure 3 demon- 
strate that the new bandwidth selection procedure passes the bi-modality test 
[10], which consists of testing the performance of a bandwidth selection pro- 
cedure using a bimodal target density, with the two modes at some distance 
from each other. It has been demonstrated in [10] that, by separating the 
modes of the target density enough, existing plug-in selection procedures 
can be made to perform arbitrarily poorly due to the adverse effects of the 
normal reference rules. The proposed plug-in method in Algorithm 1 per- 
forms much better than existing plug-in rules, because it uses the theoretical 
ideas developed in [48], except for the detrimental normal reference rules. 
A Matlab implementation of Algorithm 1 is freely available from [4], and 
includes other examples of improved performance. 

Algorithm 1 can be extended to bandwidth selection in higher dimensions. 
For completeness we describe the two-dimensional version of the algorithm 
in Appendix E. The advantages of discarding the normal reference rules 
persist in the two-dimensional case. In other words, the good performance 
of the proposed method in two dimensions is similar to that observed in the 
univariate case. For example, Figure 4 shows the superior performance of the 
ISJ method compared to a plug-in approach using the normal reference rule 
[52, 53], and with kernels assumed to have a diagonal covariance matrix with 
a single smoothing parameter: S = tl. We estimate the bivariate density, 
I Sfc=i N(/*j b , -Z"), from a sample of size N = 400, where 

/ii = (0,0), /x 2 = (0,50), ^ = (50,0), a* 4 = (50,50). 

Note that using a plug-in rule with a normal reference rule causes significant 
over-smoothing. The integrated squared error for the ISJ method is 10 times 
smaller than the corresponding error for the plug-in rule that uses a normal 
reference rule [52, 53]. 
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Fig. 4. Right panel: plug-in rule with normal reference rule; left panel: the Improved 
S heather- J ones method; the normal reference rule causes significant over- smoothing. 



5.2. Bandwidth selection for the diffusion estimator. We now discuss the 
bandwidth choice for the diffusion estimator (9). In the following argument 
we assume that / is as many times continuously differentiable as needed. 
Computation of t* in (23) requires an estimate of ||i/|| 2 and Kf[a~ 1 (X)]. 
We estimate E/fcr^pf)] via the unbiased estimator Yli=i a The 
identity ||£/|| 2 = 'EfL*Lf(X) suggests two possible plug-in estimators. The 
first one is 

i N 

E f L*Lf:=-^2L*Lg(x;t 2 ) 



(31) 



3=1 
N N 

-2^^L*L K (x,X 4 ;t 2 ) 



N 2 



x=Xi 



i=l j=l 

where g{x\t2) is the diffusion estimator (9) evaluated at t 2 , and 3£ 
second estimator is 

Wf -.= \\Lg{--t 2 )f 



The 



(32) 



dt 



(•;t 2 ) 



N N 



7Y2 X^^Z 

i=l j=l 
N N 

jpJ2J2 L * L < x > x " 2t ^ 



— {x,X i ;t2)-^{x,X j ;t2)dx 



x=Xi 



i=l j=l 

where the last line is a simplification that follows from the Chapman- 
Kolmogorov equation (16). The optimal t\ is derived in the same way that 
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*t2 is derived for the Gaussian kernel density estimator. That is, t?j is such 

that both estimators EfL*Lf and ||^/|| 2 have the same asymptotic mean 
square error. This leads to the following proposition. 

Proposition 3. The estimators EjL*Lf and \\Lf\\ 2 have the same 
asymptotic mean square error when 



(33) 



8 + V2 -3V2E f [g- l (X)} \ 2/7 
24 8^NE f [L*L 2 f{X)]J ' 



Proof. Although the relevant calculations are lengthier, the arguments 
here are exactly the same as the ones used in Proposition 1. In particular, we 
have the same assumptions on t about its dependence on N. For simplicity 
of notation, the operators L* and L are here assumed to apply to the first 
argument of the kernel k: 

E f [EjI?Lf} 

N N 



N 2 

i=i j=i 



1 

N 



f{x)L*LK(x,X i; t) 



dx 

Xi=x 



8^t 5 / 2 N V ' 



+ J J f(y)f(x)L*LK(x,y;t)dydx + 0(N- 1 ) 

3 ^i 2 N )] + / f{y) I vm*M*,v;t)*cdy 

+ 0(N- 1 (l + t- 3 / 2 )) 
3V2E f [a-^X)} 



8^t 5 / 2 N 
+ 0{N- 1 {l + t- 3/2 )+t 2 ), 
where we have used a consequence of Lemma 1 



+ \\Lf\\ 2 +t j f(y)L*L 2 f(y)dy 



f(x)L*LK(x,Xi;t) 



3V2E f [a-HX)] 
dx ~ -= = ,„ , tlO, 

X i= x 8^5/2 
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L*Lf(x)K(x, y; t) dx = I ^ - K(y,x;t) dx 



and a consequence of the detailed balance equation (15) 

)L*L 

p{y) 

= L*Lf(y) + tL*L*Lf(y) + 0(t 2 ). 
Therefore, the squared bias has asymptotic behavior (N — > oo) 

CMbTl^/]- \\Lfff- ( 3V ^/ 2 ^ )] +* / f(y)L*L 2 f(y)dy) 2 - 

Since estimator ||L/|| 2 equals KfL*Lf when the latter is evaluated at 2t2, 

the asymptotic squared bias of ||^/|| 2 follows immediately, and we simply 
repeat the arguments in the proof of Proposition 1 to obtain the desired t* 2 . 
□ 

Note that t 2 has the same rate of convergence to as *t2 in (28). In 
fact, since the Gaussian kernel density estimator is a special case of the 
diffusion estimator (9) when p(x) = a(x) = 1, the plug-in estimator (32) for 
the estimation of ||-L/|| 2 reduces to the plug- in estimator for the estimation 
°f ill/"l| 2 - I n addition, when p(x) = a(x) = 1, the iJj in (33) and hi 
(28) are identical. We thus suggest the following bandwidth selection and 
estimation procedure for the diffusion estimator (9). 



Algorithm 2. 

1. Given the data X\, . . . ,Xn, run Algorithm 1 to obtain the Gaussian 
kernel density estimator (1) evaluated at *t and the optimal bandwidth 
y/~*t2 f° r the estimation of ||/"|| 2 . This is the pilot estimation step. 

2. Let p(x) be the Gaussian kernel density estimator from step 1, and let 
a(x) =p a (x) for some a £ [0, 1]. 

3. Estimate ||-L/|| 2 via the plug-in estimator (32) using t% = *t 2: where *i2 
is computed in step 1. 

4. Substitute the estimate of ||i/|| 2 into (23) to obtain an estimate for t*. 

5. Deliver the diffusion estimator (9) evaluated at t* as the final density 
estimate. 



The bandwidth selection rule that we use for the diffusion estimator in 
Algorithm 2 is a single stage direct plug-in bandwidth selector, where the 
bandwidth t 2 for the estimation of the functional ||-^/|| 2 is approximated 
by *t2 (which is computed in Algorithm 1), instead of being derived from a 
normal reference rule. In the next section, we illustrate the performance of 
Algorithm 2 using some well-known test cases for density estimation. 
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Remark 5 (Random variable generation). For applications of kernel 
density estimation, such as the smoothed bootstrap, efficient random vari- 
able generation from the diffusion estimator (9) is accomplished via the Euler 
method as applied to the stochastic differential equation (8) (see [34]). 

Algorithm 3. 

1. Subdivide the interval [0,t*] into n equal intervals of length 5t = t* jn for 
some large n. 

2. Generate a random integer / from 1 to N uniformly. 

3. For i = 1, ... ,n, repeat 

Y t = Yi- X + ntYi-^St + aiYi^VdtZi, 

where Z\, . . . , Z n ~u.d. N(0, 1), and Yq = Xj. 

4. Output Y n as a random variable with approximate density (9). 

Note that since we are only interested in the approximation of the sta- 
tistical properties of Y n , there are no gains to be had from using the more 
complex Milstein stochastic integration procedure [34]. 

6. Numerical experiments. In this section, we provide a simulation study 
of the diffusion estimator. In implementing Algorithm 2, there are a num- 
ber of issues to consider. First, the numerical solution of the PDE (7) is 
a straightforward application of either finite difference or spectral methods 
[36]. A Matlab implementation using finite differences and the stiff ODE 
solver ode 15s. m is available from the first author upon request. Second, we 
compute ||L<7(-;£2)|| 2 m Algorithm 2 using the approximation 

2 

^\\g(-,t + e)-g(-,t)\\ 2 /£ 2 , £-<l, 

where g ( • ; t ) and g(-;t + e) are the successive output of the numerical integra- 
tion routine (odel5s .m in our case). Finally, we selected a = 1 or a(x) = p(x) 
in Algorithm 2 without using any clipping of the pilot estimate. For a small 
simulation study with a = 0, see [5]. 

We would like to point out that simulation studies of existing variable- 
location scale estimators [27, 46, 51] are implemented assuming that the 
target p.d.f. / and any functionals of / are known exactly and no pilot 
estimation step is employed. In addition, in these simulation studies the 
bandwidth is chosen so that it is the global minimizer of the exact MISE. 
Since in practical applications the MISE and all functionals of / are not 
available, but have to be estimated, we proceed differently in our simula- 
tion study. We compare the estimator of Algorithm 2 with the Abramson's 
popular adaptive kernel density estimator [1]. The parameters *t and ^2 



\\Lg(-,t)f 



dg 
Ot 



(•;*) 
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of the diffusion estimator are estimated using the new bandwidth selection 
procedure in Algorithm 1. The implementation of Abramson's estimator in 
the Stata language is given in [33]. Briefly, the estimator is given by 



where A? = G/f(Xi;t p ), G= (n£=i f{Xi\tp)) 1,N , and the bandwidths yft 
and *Jtp are computed using Least Squares Cross Validation (LSCV) [38]. 
Our criterion for the comparison is the numerical approximation to 

Ratio- W:h-ff 



II/a - /IP 

that is, the ratio of the integrated squared error of the diffusion estimator 
to the integrated squared error of the alternative kernel density estimator. 

Table 2 

Results over 10 independent simulation experiments. In all cases the domain was 

assumed to be K 



Case 


Target density f(x) 


N 


Ratio I 


Ratio II 


1 


iN(0,(i) 2 ) + |N(5,l) 


10 3 
10" 


0.9 
0.23 


0.82 
0.48 


2 


|N(0,l)+E 4 fc =oT5 N (|-b(^) 2 ) 


10 3 
3 x 10 5 


0.65 
0.11 


0.99 
0.51 


3 


ELoiN(3((|) fc -l),(|) 2fc ) 


10 3 
10 5 


1.05 
0.15 


0.75 
0.45 


4 


TfN(-l,(|) 2 ) + T iN(l,(|) 2 ) + 3i IJ ELoN(^,( T io) 2 ) 


10 3 
10 5 


0.94 
0.46 


0.63 
0.76 


5 


2 V^ 2 M(-12fc-15 t2\2\ | 1 v^ 10 Mf2fc ( 1 \2\ 
7 2~ik=0'^\ 7 'WJ 1^ 21 Z-ik=% n \ 7 ' V 21 I I 


10 3 
10 5 


0.54 
0.12 


2.24 
0.84 


6 


it EU N(2fc - 1, (|) 2 ) + ELi 550 N (-i (loo) 2 ) 


10 4 


0.83 


0.93 




~ Z^fe=l 300 '^V 2 ' V 100 J ^ 


10 5 


0.55 


0.68 


7 


iN(-2,i) + iN(2,i) 


10 3 
10 5 


0.51 
0.41 


0.51 
0.89 


8 


|N(0,l) + iN(|,(I) 2 ) 


10 3 
10 6 


0.59 
0.79 


0.53 
1.01 


9 


Log-Normal with /i = and u = 1 


10 3 


0.17 


0.85 






10 5 


0.12 


0.51 


10 


|N(0,l) + EL- 2 ^N(fc+i,(^) 2 ) 


10 3 
10 4 


0.88 
0.30 


0.98 
0.85 
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Table 3 

Practical performance of the boundary bias correction of the diffusion estimator for the 
test cases: (1) exponential distribution with mean equal to unity; (2) test cases 1 through 
8, truncated to the interval (— oo,0] 



Test case 


Exp(l) 


1 


2 


3 


4 


5 


6 


7 


8 


Ratio 


0.52 


0.38 


0.74 


0.25 


0.70 


0.38 


0.74 


0.56 


0.46 



Table 2, column 4 (ratio I) shows the average results over 10 independent 
trials for a number of different test cases. The second column displays the 
target density and the third column shows the sample size used for the 
experiments. In the table N(/x,cr 2 ), denotes a Gaussian density with mean 
\jl and variance <r~ . Most test problems are taken from [42]. For each test 
case, we conducted a simulation run with both a relatively small sample size 
and a relatively large sample size wherever possible. The table shows that, 
unlike the standard variable location-scale estimators [27, 51], the diffusion 
estimator does not require any clipping procedures in order to retain its 
good performance for large sample sizes. 

Next, we compare the practical performance of the proposed diffusion es- 
timator with the performance of higher-order kernel estimators. We consider 
the sine kernel estimator defined as 

t , s 1 1 rrfx — X{\ j . sin(x) 

where again \ft is selected using LSCV. Table 2, column 5 (ratio II) shows 
that the results are broadly similar and our method is favored in all cases ex- 
cept test case 5. Higher-order kernels do not yield proper density estimators, 
because the kernels take on negative values. Thus, an important advantage 
of our method and all second order kernel methods is that they provide 
nonnegative density estimators. As pointed out in [53], the good asymptotic 
performance of higher-order kernels is not guaranteed to carry over to finite 
sample sizes in practice. Our results confirm this observation. 

In addition, we make a comparison with the novel polynomial boundary 
correction method of Hall and Park [20]. The results are given in Table 3, 
where we use some of the test cases defined in Table 1, truncated to the 
interval (— oo, 0]. Table 3 shows that for finite sample sizes the practical per- 
formance of our approach is competitive. We now give the implementation 
details. Let f3 be the point of truncation from above, which is assumed to 
be known in advance. Then, the Hall and Park estimator is 

(34) /«(*;*) = — , j:4 X ~ Xl r a{X) \ *<t, 
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where a(x) = t fy^- p(^-r^); fo(x) is equivalent to f a {x) when a(x) = 0, and 

/o(x) is an estimator of f'(x); p(u) = v<f>(v) dv. We use LSCV to 

select a suitable bandwidth \/t. The denominator in (34) adjusts for the 
deficit of probability mass in the neighborhood of the end-point, but note 
that theoretically (34) does not integrate to unity and therefore random 
variable generation from (34) is not straightforward. In addition, our esti- 
mator more easily handles the case with two end-points. On the positive 
side, Hall and Park [20] note that their estimator preserves positivity and 
has excellent asymptotic properties, which is an advantage over many other 
boundary kernels. 

Finally, we give a two-dimensional density estimation example, which to 
the best of our knowledge cannot be handled satisfactorily by existing meth- 
ods [19, 31] due to the boundary bias effects. The two-dimensional version 
of equation (2) is 

f(x;t)4(g(x;t) + g ( x;t)) V*>0,*6*, 

/(x;0) = A(x), 

n-V/(x;f) = Vt>0, 

where x = (xi,X2) belongs to the set X C R 2 , the initial condition A(x) is 
the empirical density of the data, and in the Neumann boundary condition 
n denotes the unit outward normal to the boundary d2£ at x. The partic- 
ular example which we consider is the density estimation of 600 uniformly 
distributed points on the domain = {x:x 2 + (4x2) 2 < 4}. We assume 




Fig. 5. A two-dimensional example with 600 points generated uniformly within an ellipse. 
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that the domain of the data 2£ is known prior to the estimation. Figure 5 
shows /(x;i*) on X = {x:xf + (4x2) 2 < 4}, that is, it shows the numeri- 
cal solution of the two-dimensional PDE at time t* = 0.13 on the set X . 
The bandwidth was determined using the bandwidth selection procedure 
described in Appendix E. We emphasize the satisfactory way in which the 
p.d.f. /(x;t*) handles any boundary bias problems. It appears that cur- 
rently existing methods [19, 22, 31, 32] cannot handle such two-dimensional 
(boundary) density estimation problems either because the geometry of the 
set X is too complex, or because the resulting estimator is not a bona-fide 
p.d.f. 



7. Conclusions and future research. We have presented a new kernel 
density estimator based on a linear diffusion process. The key idea is to con- 
struct an adaptive kernel by considering the most general linear diffusion 
with its stationary density equal to a pilot density estimate. The result- 
ing diffusion estimator unifies many of the existing ideas about adaptive 
smoothing. In addition, the estimator is consistent at boundaries. Numeri- 
cal experiments suggest good practical performance. As future research, the 
proposed estimator can be extended in a number of ways. First, we can con- 
struct kernel density estimators based on Levy processes, which will have 
the diffusion estimator as a special case. The kernels constructed via a Levy 
process could be tailored for data for which smoothing with the Gaussian 
kernel density estimator or diffusion estimator is not optimal. Such cases 
arise when the data is a sample from a heavy-tailed distribution. Second, 
more subtle and interesting smoothing models can be constructed by con- 
sidering nonlinear parabolic PDEs. One such candidate is the quasilinear 
parabolic PDE with diffusivity that depends on the density exponentially: 

*) I^)), «>o. 

Another viable model is the semilinear parabolic PDE 

where u(x;t) =log(g(x;t)) is the logarithm of the density estimator. The 
Cauchy density n ( x $ +t Z} is a particular solution and thus the model could 
be useful for smoothing heavy-tailed data. All such nonlinear models will 
provide adaptive smoothing without the need for a pilot run, but at the cost 
of increased model complexity. 
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APPENDIX A: GAUSSIAN KERNEL DENSITY ESTIMATOR 

PROPERTIES 

In this appendix, we present the technical details for the proofs of the 
properties of the diffusion estimator. In addition, we include a description 
of our plug-in rule in two dimensions. 

We use II • II to denote the Euclidean norm on R. 



Theorem 3. Let t = t^ be such that lirnjv->-oc tjv = and limjv-»oo iVyr/v 
oo. Assume that f" is a continuous square-integrable function. The inte- 
grated squared bias and integrated variance of the Gaussian kernel density 
estimator (1) have asymptotic behavior 



(35) \\E f [f(-,t)]-fr =in\rr+o(n N 



oo, 



and 



oo, 



(36) j V aif [f( X] t)]dx = ^= + ((NV~ty 1 ), N- 

respectively. The first-order asymptotic approximation of MISE, denoted 
AMISE, is thus given by 

(37) AMISE{/}(t) = ^ 2 ||r|| 2 + 



4 2N^/rt' 
The asymptotically optimal value of t is the minimizer of the AMISE 

1 \ 2/5 



(38) J 
giving the minimum value 



2N^\\f"\ 

5||.f"|l 2 / 5 



(39) AMISE{/}(^) = iV- 4 / 5 47/57r2/5 . 

For a simple proof, see [53]. 

APPENDIX B: PROOF OF LEMMA 1 

We seek to establish the behavior of the solution of (11) and (10) as 1 1 0. 
We use the Wentzel-Kramers-Brillouin- Jeffreys (WKBJ) method described 
in [2, 8, 29, 43]. In the WKBJ method, we look for an asymptotic expansion 
of the form 

oo 

(40) KCx^tJ-e-V^)- 8 ^) J> m - ^ 2 C m (x,y), t±0, 

where {C m (x,y)} and s(x,y) are unknown functions. To determine s(x,y) 
and {C m (x,y)} , we substitute the expansion into (10) and, after canceling 



KERNEL DENSITY ESTIMATION VIA DIFFUSION 31 

the exponential term, equate coefficients of like powers of t. This matching 
of the powers of t leads to solvable ODEs, which determine the unknown 
functions. Eliminating the leading order 0(i -5 / 2 ) term gives the ODE for s 



(41) a(x) 



d 1 2 

Tx s{x,y) 



■ p(x) = 0. 



Setting the next highest order 0(t 3 / 2 ) term in the expansion to zero gives 
the ODE 

= 2a(x)s(x, y) — —p(x)C (x, y) - 2a(x)s(x, y) ^-p 2 {x) — — 
ox ax ox ox 

+ p 3 (x)C {x, y) + s 2 (x, y)p 3 (x)C 1 (x, y) 
(42) -^p 2 (x) S (x,y)^C (x,y) 

r\ \ 2 / r\ \ 2 

® S \ 2/ \n / ( ® s \ 2/ 



+ a(x)s ( X ^y)y-Q^j P ( x )Ci{x,y) - a ( x )[j^;j P ( x ) c o(x,y) 

- a(x)s(x, y) ^p 2 (x)C {x, y). 

To determine a unique solution to (41), we impose the condition s(x,x) = 0, 
which is necessary, but not sufficient, to ensure that lim^ n(x, y; t) = 5(x — 
y). This gives the solution 



Substituting this solution into (42) and simplifying gives an equation without 
Ci(x,y), 

(43) C (x,y)p(x) ^+Aa(x)p(x) ^ - 3C (x,y) ^a(x) = 0, 

whence we have the general solution Cq(x, y) = h(y)p 3 ^ i (x)a~ 1 ^ 4: (x) for some 
as yet unknown function of y, h(y). To determine h(y), we require that 
the kernel K,(x,y;t) satisfies the detailed balance equation (15). This en- 
sures that K(x,y;t) also satisfies (11). It follows that Co(x,y) has to satisfy 
p(y)Co(x,y) = p(x)Co(y,x), which after rearranging gives 



h(x)(a(x)p(x)) 1/i = h{y)(a(y)p(y)) 



1/4 



A separation of variables argument now gives h(y)(a(y)p(y)) 1 ^ 4: = const., 
and hence 

Co(x,y) = con S t.(a(y)p(y))~ 1/ V / H x )a- 1/ \x). 
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We still need to determine the arbitrary constant. The constant is chosen so 
that 

I'OO 



/oo 
k(x, v; t) dx = 1, 
-oo 



which ensures that lining n(x,y;t) = 5{x — y). This final condition yields 

p(x) 



C (x,y) 



27r(a(y)p(y)a(x)p(x)y/^ 



and hence 
K(x,y;t) 



P{x) ^ j 1_ 

f7 M[p{x)a{x)a{y)p{y)} 1 / i ^\ 2t 



lp(s) 



y V °( s ) 



ds 



Remark 6. Matching higher powers of t gives first order linear ODEs 
for the rest of the unknown functions {C m (x,y),m > 1}. The ODE for each 
C m (x,y),m = 1,2,3, .. . is 



as'(C m /pY 



(as')' 
2p 



+ (m - 1/2) C m = (a(C m _i/p)')', C m (y, y) = 



where all derivatives apply to the variable x and y is treated as a constant. 
Thus, in principle, all functions {C m (x,y)} can be uniquely determined. 

It can be shown (see [8]) that the expansion (40) is valid under the 
conditions that a,p and all their derivatives are bounded from above, and 
p(x) > po > 0, a(x) > ao > 0. Here, we only establish the validity of the lead- 
ing order approximation k under the milder conditions (17). We do not 
attempt to prove the validity of the higher order terms in (40) under the 
weaker conditions. The proof of the following lemma uses arguments similar 
to the ones given in [8]. 

Lemma 2. Let a(x) and p(x) satisfy conditions (17). Then, for all t£ 
(0,to\, where to > is some constant independent of x and y, there holds 

\K(x,y;t)-K(x,y;t)\ < const. C (x, y)t 1/4 e" s2( ^ )/(2t) Vx,y. 

To prove the lemma, we first begin by proving the following auxiliary 
results. 



Proposition 4. Define 



i( Z ) = i( Z] x,y,t,r) = im + s2 ^ y) 



2(t-r) 



2t 
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Then for r G (0, t), we have 

Hz) > i^y). 

Moreover, there exists a unique zq = Zo(x,y,t,r) for which £(zq) = s , 
and £(z) is increasing for z > zq and decreasing for z < zq. 



Proof. We have 



l(z) 



i_(£ a - 1(s)rfs ) 2 + j_(r ff - 1(s)rfs ) 2 , 



2(t 
and hence 

(44) /(z)= [ X a -\ s ) ds + ^A f a-\ s )ds. 



t-r J z T 

For x 7^ y, £'{y) > 0,£'(x) < 0, and therefore by the continuity of £' , there 
exists zq £ (x, y) : £'(zq) = 0. For x = y, set zq = x. Setting z = zq in (44), 

(45) — — / a- 1 {s)ds = - a- 1 {s)ds. 

t — TJzo T Jy 

Therefore, /^o" _1 (s) ds = ^ J* °a~ 1 (s) ds and adding jy°a~ 1 (s) ds to both 
sides we obtain 



/ a~ 1 (s)ds = — / a~ 1 (s)ds, 

Jy 7~ Jy 



from which we see that (45) is also equal to jfy°~ l (s)ds. Hence, by sub- 
stitution £(zq) = ^{fy a~ 1 (s) ds) 2 , as required. Finally, note that if F{z) = 
£{z) - 277^tj(£ o-^'is) ds) 2 , then F'(z) = for all z. Hence, F{z) = F{z ) = 
£(zq) and 

(46) £(z)=£(z ) + 2r(f f _ r) (jfV^cfe) . 

As a consequence of Proposition 4, we have the following result. □ 

Proposition 5. Assuming lim z ^.± O0 J a~ 1 (s) ds = ±00, we have the 
following equality: 



1 I f°° / e s 2 (x,z)/(2(t-r)) e -s*{z,y)/(2T)\2 

dzdr 



= 2 ^-l/4 t l/4 r 2 ( 3 /4)e -^(x, y )/( 2t ) 

= c 2 t 1 ' i e- s2 ^^ 2t \ 
where c 2 is a constant [indeed c 2 = 27r~ 1//4 r 2 (3/4)/. 
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Proof. We have 

e -s 2 {x,z)/(, . , e 



oo -s 2 (x,z)/(t-T) -s 2 (z,y)/i 



t — T a{z)T 



dz 



(t-r)r 



1 



oo 



r r 



with the change of variable viz) = , 1 f z a 1 (s)ds. Then the result 

follows from the fact that f*(r(t - r))~ 1/4 dr = 2vr^ 1 / 2 t 1 / 2 r 2 (3/4). 

Given these two auxiliary results, we proceed with the proof of Lemma 2. 
Writing 

Q _ _ e -* 2 (^,J/)/(2i) 
K*(x,y;t) = —K(x,y;t) - LH{x,y;t) = LC (x,y), 

we define inductively the following sequence of function {pj}, starting with 
P0=0: 

rt roc 

Pj+i(x,y,t) = -K*(x,y;t)- / k* (x, z;t - T)pj(z,y;r) dz dr, 

JO J-oo 

J = 1,2,.... 

Note in particular that p\ = —k*. We will show that there exists a limit of 
{pj}- We begin by proving via induction that for j > 1, x, y G R, t G (0, to], 
where 



to = uiin 

there holds 



2 C1C2 ; '7' 



(47) ^(^^^-^(^^^l^^lLCo^y)!* 1 /^ (a:,y)/(2t) , 



where C3 = 2ciC2/v27r. First, we calculate for j = 1 

/■oo 

p 2 (x,y,t) = -K*(x,y,t) + / K*(x,z,t - t)k* (z,y,r) dzdr. 

JO J-oo 

Therefore, we have the following bound: 

|p 2 (z,y,t) -pi(x,y,t)| 

rt poo 

< / |K*(x,z,t — r)K*(z,y,r)\ dzdr 
Jo J-oo 
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t poo e -s 2 (x,z)/(2(t-r)) e -s 2 (z,y)/(2r) 

j== 7= \LC (x,z)LC (z,y)\ dz dr 

3 J-oo VI- T V r 

t f oo e -s 2 (x,z)/(2(t-T)) e -s 2 (z,y)/(2r) 




V't — T v'a(z)T 



x ^a(zj\LCo(x,y)\ ' , , ^ u dzdr 



\Lq(z) 



2^{a{z)p{z)Y^ 

1 /■* f 00 e -s 2 (^^)/(2(t-r)) 

|^C (x,y)| 



27T ' Jo J-oo y/t — T 



e -s 2 (z,y)/(2T) S L /^| 

x — J — t 1 -^ dr 



V Z7T 



where the last inequality follows from the Cauchy-Schwarz inequality, Propo- 
sition 5 and assumption (17). We thus have 

|^(x,y,t)-p 1 (x,y,t)|<||LCo(x,y)|t 1 / 4 e- s2 ^)/( 2t ). 

Next, assume the induction statement is true for 2,3, ... ,j — 1. Then 
\p j+ i(x,y,t) - pj(x,y,t)\ 

f't f'OO 

< \K*(x,z,t-T)\\pj(z,y,T) - pj-i(z,y,T)\dzdr 



< 



J-oo 

o e -s 2 (x,z)/(2(t-r)) 




'0 J-oo 

.2 



-^== \LC (x,z)\-^\LC (z,y)\ 



XT ^ e s^ y )/{2r) dzdT 



rt ,oo e -s 2 (x,z)/(2(t-r)) e -s 2 (z,y)/(2r) 




T 



<^ T |XCo(x, 2/ )|t 1 /4 e - S 2 (^)/(2 t ) t 3/4 



XT ^m^Ldzdr 

ClC 2 



The last line follows from the Cauchy-Schwarz inequality and the fact that 
r 3/4 < t 3/4 < ^3/4 Since tl /4 ^ < I, we obtain 
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This establishes (47). Next, we have the bound for all j ' > 1: 



(48) <|LC (x, y )|(i= + C 3t 1 /^ e - 2 (^)/( 2 *) 

<|LC (x, y )|A e - 2 (^)/( 2i ). 

In the light of (48) and (47), the pointwise limit 

p(x,y,t) = lim pj(x,y,t) 

exists onlxKx (0, to). In addition, p(x,y,t) satisfies the limiting equation 

rt roo 

= K*(x,y,t) + p(x,y,t) + / / n*(x,z,t-T)p(z,y,T)dzdT, 
and indeed 



./-oo 



/•f /'CXI 

(49) K(x,y;t) — K(x,y;t) = / / K(x,z,t — r)p(z,y,r) dzdr. 

Jo J-oo 

In order to see this, we can apply directly the arguments of Section 5 of [8] 
in the case N = 0; see also Section 1.3 of [14]. Hence, we can take the limit 
in (48) to conclude 

(50) \p(x,y,t)\ <2|LC (x,y)|t- 1 / 2 e- s2 (^/( 2 ') 
for t G (0, to]. The claim of the lemma then follows from 

\n{x,y;t) -k(x, y;t)\ 

rt roo 

< / k(x, z, t — r)\p(z, y, t)\ dz dr 

JO J-oo 

f t roo e -s 2 (x,z)/(2(t-r)) -s 2 (z,y)/(2r) 

<2 / j== C (x,z) -= \LC (z,y)\dzdT 

Jo J-oo Vt-T V r 

<^=C (x,y) / 6 - _ lL f Zl dzdr 

V27T Jo J-oo \ft~T V Cr ( 2 r q ( Z > 

<2C (x,y)tV4 e - 2 (^)/( 2t )£i^ = C3Co(X)y ^i/4 e -^(^)/( 2 t) > 

V27T 



KERNEL DENSITY ESTIMATION VIA DIFFUSION 



37 



APPENDIX C: PROOF OF THEOREM 1 

Note that (18) is given by K,(x,y;t)f(y) dy — f(x), and from (11) we 
have 

d f 

gl9(x;t)= f(y)L K{x,y;t)dy 

1 d f f (y') \ f 

2 dy \p(y) J y ed3t J x 

Given that 2£ = R, Lemma 1 gives k(x, y; t)\ y& g^ ~ k(x, y; i))^!^, 1 1 0. 
The last term is zero since for fixed x, 

2 



lim 

y— S-±oo 



' P(g) 

o(s) 



r/.s 



■ oo, 



and hence lim^-too K,(x,y;t) = 0. We have 

t) = g(x; 0) + t ^g(x; t)\ + 0(t 2 ), 

because g(x;t),t > is smooth (see, e.g., Theorem IV- 10 • 1 in [35]). There- 
fore, 

g(x;t) = f(x) + tLf(x) + 0(t 2 ), 

and (18) and (19) follow. We now proceed to demonstrate (20). First, the 
second moment has the behavior 



E f [K 2 {x,Y;t)] 



3C 



f(y)K 2 {x,y-t)dy. 



p 2 {x) 



f(y) 



f(y)K 2 (x,y;t)dy 



2irtyjp(x)a(x) J-oo y/p{y)a(y) 



-l/2{^27tfy^p(s)/a(s)ds}< 



dy. 



We can simplify the last expression by the change of variable u= \/ j x 



E V$y ds - This gives 



p 2 (x) 



2vr v / 2tv / P( x ) a ( 3 



p{y{u,t)) 



where y(u, t) = y(u, 0) + >/t^|t=o + 0(t) = x + 



u 



ta(x) 



+ 0(t) is a Taylor 



2p(x) 

expansion of y(u, t) at \ft = 0. Therefore, ^fe^ ~ ^jfy as i 1 0, and 



p 2 (x) 



2-Kyj2tyJp(x)a{x) J-oo P(y{u, t)) 



f(y(^)l e -uV2 dui 



1 



2v/vrt 



/(*) 



l p(x) 
a(x) 



no. 
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Hence, from (9) we have 

±E f [ K \x,Y;t)]-± 
fix 



Var f \g(x;t)] = -E f [ K \x,Y;t)] - -E f [n(x,Y;t)] 2 



2NVTria(x) 
from which (21) and (20) follow. 



HO, 



APPENDIX D: CONSISTENCY AT BOUNDARY 

As in [53], we consider the case where the support of / is [0, oo]. The con- 
sistency of the estimator near x = is analyzed by considering the pointwise 
bias of estimator (9) at a point xn such that xn is 0{y/t^) away from the 
boundary, that is, xn is approaching the boundary at the same rate at 
which the bandwidth is approaching 0. We then have the following result, 
which shows that the diffusion estimator (9), and hence its special case (3), 
is consistent at the boundaries. 



Proposition 6. Let X = [0, oo], and assume that x = xjy = cty/tN for 
some constant a S [0,1], where lim7v-s.oo tN = and lim7v->oo ^VtN = oo. 
Then for the diffusion estimator (9) we have 

E f g(x N ;t)=f(x N ) + 0(Vt^), N ^ oo. 

Hence, the diffusion estimator (9) is consistent at the boundaries. 

PROOF. First, we differentiate both sides of Efg(x; t) = f f(y)K(x; y; t) dy 
with respect to t and use (11) to obtain 



roc q 

%fl r (a?l*) = J f{y)-^K{x;y;t)dy 
f(y)L*K(x;y;t)dy 



Ap(y)J 



y=oo 

a(y)n(x;y,t) + / n(x;y;t)Lf(y) dy. 

y=0 



Second, we show that ^^(a^/tN^, 0; ijv) = 0(t 1 ^ 2 ) and lim^oo n{a\ft^;y; ijv) - 
o(l), and f Q k(x; y; trq)Lf{y) dy = 0(1) as N — > oo. To this end, we consider 
the small bandwidth behavior of k. It is easy to verify using Lemma 1 that 
the boundary kernel 

K B (x,y;t) = K{x,y;t) + k(x, -y;t) 
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satisfies 

^k b (x, y; t) = L* KB (x, y; t) + ( e - 2 (* l w)/(2t) t -i/2) j t ; , 

on x,y £M with initial condition KB(x,y;0) = 5(x — y). In addition, the 
boundary kernel satisfies the condition J^kb(x, y; t)\ y= o = 0, and therefore 
Kb describes the small bandwidth asymptotics of the solution of the PDE 
(11) on the domain x,y £ [0, oo) with boundary condition J^k(x, y; t)\ y= o = 
0. Hence, we have 

K{aVi;0;t) ~ KB (aVi;0;t) = const. t~ 1/2 e°^\ t±0, 

and 



lim KsiocVi; y; t) = 0, t > 0. 

y— >oo 



Therefore, 

d 



or 



% M f g(x N ;t N ) = o(l) - 0(^ 1/2 ), iV -> oo, 

hO(tjv) = 0{t N ) + 0(1), iV^oo, 

which, after rearranging, gives 

E■fg(x N ^,t N ) = f(x N ) + 0(^/iN), A^oo. □ 

APPENDIX E: BANDWIDTH SELECTION IN HIGHER DIMENSIONS 

Algorithm 1 can be extended to two dimensions for the estimation of a 
p.d.f. /(x) on M 2 . Assuming a Gaussian kernel 

where x= [rri,:^] 2 " and y = [yi,y2] T , the asymptotically optimal squared 
bandwidth is given by ([53], page 99) 

f = (2^V(Vo,2 + ^2,0 + 2Vi,i)r V3 , 

where 

^i = (-ir +J / /(x) 2j /(x)rfx, i,i€N+, 



Sxf 1 <9x 9 

(51) 

J \ dx\ dxP 2 J 
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Note that our definition of tp differs slightly from the definition of tp in [53] . 
Here the partial derivatives under the integral sign are applied 2(i+j) times, 
while in [53] they are applied (i + j) times. Similar to the one-dimensional 
case, there are two viable plug-in estimators for tpi j. The first one is derived 
from the first line of (51): 



(52) 



N 2 



EE 



k=lm =l dx l dx 2 

and the second one is derived from the second line of (51): 

N N - 



j\[2 E E 



(53) 



fc=l m=l 



dx\ dx\ ' dx\ dx\ 



x,X fc ;ti )3 -)dx 



AT2 



The asymptotic expansion of the squared bias of estimator ipij is given by 
([53], page 113) 



(54) 
where 

Thus, we have 
(55) 



(E/hM-^ij) 



™3 



+ 



(-i) j 
i 



1 x 3 x 5 x ••• x (2j - 1) 



2vr 



i>i, 
i = o. 



g(»)g(j) 

Af(2t iJ ) J +i+ 1 



iV->oo. 



For both estimators the squared bias is the dominant term in the asymptotic 
mean squared error, because the variance is of the order 0(N^ 2 t^ 2l ~ 2 ^~ l ). 
It follows that both estimators will have the same leading asymptotic mean 
square error term provided that 



1 i o— i— .7— 1 



l/(2+i+j) 



KERNEL DENSITY ESTIMATION VIA DIFFUSION 



41 



We estimate t$ , via 



l + 2^'- 1 -2q{i)q{j) \ 1 /( 2 + i +i) 



(57) 4j 

Thus, estimation of ijjij requires estimation of V'ij'+i an d tpi+ij^ which in 
turn requires estimation of tpi^.j-i^i+lj+l, ^i,j+2 and so on applying for- 
mula (57), recursively. Observe that to estimate all tpij for which i + j = k, 
that is, {tftij :i + j = k},we need estimates of all {ipi,j '■ i + j = k + 1}. For ex- 
ample, from formula (57) we can see that estimation of £2,o>^l,i)io2 requires 
estimation of h j0 ,h,i,h,2,to,3- 

For a given integer k > 3, we define the function 7(f) as follows. Given an 
input t > 0: 

1. Set tjj = t for all i + j = k. 

2. Use the set {tij :i + j = k} to compute all functionals {ipij :i + j = k} 
via (53). 

3. Use {ipij :i+ j = k} to compute {tij :i+ j = k — 1} via (57). 

4. If k = 2 go to step 5; otherwise set k := k — 1 and repeat from step 2. 

5. Use {^j j : i + j = 2} to output 

7 (t) = (2^V(^o,2 + ^2,0 + 2^ 1:1 ))- 1/3 . 

The bandwidth selection rule simply consists of solving the equation 7(f) = t 
for a given k > 3 via either the fixed point iteration in Algorithm 1 (ignoring 
step 4) or by using Newton's method. We obtain excellent numerical results 
for k = 4 or k = 5. Higher values of k did not change the value of t in any 
significant way, but only increased the computational cost of evaluating the 
function j(t). Again note that this appears to be the first successful plug- 
in bandwidth selection rule that does not involve any arbitrary reference 
rules, but it is purely data-driven. An efficient Matlab implementation of 
the bandwidth selection rule described here, and using the two-dimensional 
discrete cosine transform, can be downloaded freely from [4]. The Matlab 
implementation takes an additional step in which, once a fixed point of 7(f) 
has been found, the final set of estimates {rfij + j = 2} is used to compute 
the entries \ftx[ and y/tx 2 °f the optimal diagonal bandwidth matrix ( [53] , 
page 111) for a Gaussian kernel of the form 

1 p -^i-Vi? /(2t Xl )-(x 2 -y 3 ) 2 /(2t X2 ) 

2-KyJt Xl tx 2 

These entries are estimated via the formulas 



txi 



^3/4 1/3 



4ttA^o 4 (V'i,i + ^2,0^0,2) 



42 



Z. I. BOTEV, J. F. GROTOWSKI AND D. P. KROESE 



and 

,23/4 1/3 
^2,0 



4vrA^ 2 4 ('0i i i + ^2,0^0,2) 
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